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ABSTRACT 



A path link passing speech recognition system and method 
recognizes input connected speech. The recognition system 
has a plurality of vocabulary nodes associated with word 
representation models, at least one of the vocabulary nodes 
of the network being able to process more than one path link 
simultaneously, so allowing for more than one recognition 
result. 
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PATH LINK PASSING SPEECH 
RECOGNITION WITH VOCABULARY NODE 
BEING CAPABLE OF SIMULTANEOUSLY 
PROCESSING PLURAL PATH LINKS 

CROSS-REFERENCE TO RELATED 
APPLICATION 

This application is a continuation-in-part of our copend- 
ing commonly assigned application Ser. No. 08/094,268 
filed Jul. 21, 1993. 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates to speech processing and in 
particular to a system for processing alternative parses of 
connected speech. 

2. Related Art 

Speech processing includes speaker recognition, in which 
the identity of a speaker is detected or verified, and speech 
recognition, wherein a system may be used by anyone 
without requiring recogniser training, and so-called speaker 
dependent recognition, in which the users allowed to operate 
a system are restricted and a training phase is necessary to 
derive information from each allowed user. It is common in 
recognition processing to input speech data, typically in 
digital form, to a so-called front-end processor, which 
derives from the stream of input speech data a more 
compact, perceptually significant set of data referred to as a 
front-end feature set or vector. For example, speech is 
typically input via a microphone, sampled, digitised, seg- 
mented into frames of length 10-20 ms (e.g. sampled at 8 
kHz) and, for each frame, a set of coefficients is calculated. 
In speech recognition, the speaker is normally assumed to be 
speaking one of a known set of words or phrases. A stored 
representation of the word or phrase, known as a template or 
model, comprises a reference feature matrix of that word as 
previously derived from, in the case of speaker independent 
recognition, multiple speakers. The input feature vector is 
matched with the model and a measure of similarity between 
the two is produced. 

Speech recognition (whether human or machine) is sus- 
ceptible to error and may result in the misrecognition of 
words. If a word or phrase is incorrectly recognised, the 
speech recogniser may then offer another attempt at 
recognition, which may or may not be correct. 

Various ways have been suggested for processing speech 
to select the best or alternative matches between input 
speech and stored speech templates or models. In isolated 
word recognition systems, the production of alternative 
matches is fairly straightforward: each word is a separate 
'path* in a transition network representing the words to be 
recognised and the independent word paths join only at the 
final point in the network. Ordering all the paths exiting the 
network in terms of their similarity to the stored templates 
or the like will give the best and alternative matches. 

In most connected recognition systems and some isolated 
word recognition systems based on connected recognition 
techniques however, it is not always possible to recombine 
all the paths at the final point of the network and thus neither 
the best nor alternative matches are directly obtainable from 
the information available at the exit point of the network. 
One solution to the problem of producing a best match is 
discussed in "Token Passing: a Simple Conceptual Model 
for Connected Speech Recognition Systems" by S. J. Young, 
N. H. Russell and J. H. S. Thornton 1989, which relates to 
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passing packets of information, known as tokens, through a 
transition network. A token contains information relating to 
the partial path travelled as well as an accumulated score 
indicative of the degree of similarity between the input and 

5 the portion of the network processed thus far. 

As described by Young et al, at each input of a frame of 
speech to a transition network, any tokens that are present at 
the input of a node are passed into the node and the current 
frame of speech matched within the word models associated 

io with those nodes. New tokens then appear at the output of 
the nodes (having "travelled" through the model associated 
with the node). Only the best scoring token is then passed 
onto the inputs of the following nodes. When the end of 
speech has been signalled (by an external device such as a 

15 pause detector), a single token will be present at the final 
node. From this token the entire path through the network 
can be extracted by tracing back along the path by means of 
the previous path information contained within the token to 
provide the best match to the input speech. 

20 The article "A unified direction mechanism for automatic 
speech recognition using Hidden Markov Models" by S. C. 
Austin and F. Fallside, ICASSP 1989, Vol. 1, pages 
667-670, relates to a connected word speech recogniser 
which operates in a manner similar to that described by 

25 Young et al, as described above. A history relating to the 
progress of the recognition through the transition network is 
updated on exiting the word model. At the end of 
recognition, the result of recognition is derived from the 
history presented to the output which has the best score. 

30 Again only one history is possible for each path terminating 
at the final node. 

Such known arrangements do not allow for an alternative 
choice to be readily available at the output of the network. 

35 SUMMARY OF THE INVENTION 

In accordance with the invention a path link passing 
speech recognition system for recognising input connected 
speech comprises means for deriving recognition feature 

4Q data from an input speech signal, processing means for 
modelling expected input speech and for comparing the 
recognition feature data with the modelled expected input 
speech, the processing means having a plurality of vocabu- 
lary nodes associated with word representation models, and 

45 means for indicating recognition of the input speech signal 
in dependence upon the comparison, characterised in that at 
least one of the vocabulary nodes can process more than one 
path link simultaneously. 

Such an arrangement means that more than one incoming 

50 path link can be processed by a node at a given time and 
hence that more than one recognition result may be obtained. 

The modelling means preferably comprises a transition 
network containing a plurality of noise nodes and vocabu- 
lary nodes which are associated with word representation 

55 models. The nodes are capable of producing path links 
comprising fields for storing a pointer to the previous path 
link, an accumulated score for a path, a pointer to a previous 
node and a time index for segmentation information. 
Preferably, the vocabulary nodes capable of processing more 

60 than one path link have more than one identical associated 
word representation model. 

The provision that at least one of the vocabulary nodes of 
the network has more than one associated word representa- 
tion model allows the speech recogniser to process multiple 

65 paths at the same time and so allows more than one path link 
to be propagated across each inter-node link at each input 
frame. In effect, the invention creates multiple layers of a 
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transition network along which several alternative paths may BRIEF DESCRIPTION OF THE DRAWINGS 

be propagated The best scoring path may be processed by ^ ^ Qow ^ described ^ 

the first model of a node, the next best by the second and so , , £ , . • *, • 

... ... 11 i j i - • ,l * example only, with reference to the accompanying drawings 

on until either parallel models or incoming paths run out, in which* 

In general terms "network" includes directed acyclic 5 ' , . „ 

graphs (DAGs) and trees. A DAG is a network with no . f 1 * 0 '* 8 schemaUcaUy the employment of a recog- 

cycles and a tree is a network in which the only meeting of mtl0n P rocessor according to the invention in a telecommu- 

. n . . a e Ji i nications environment; 

paths occurs conceptually right at the end of the network. ' 

The term "word" here denotes a basic recognition unit, _ FIG " 2 , » a "lock diagram showing schematically the 

which may be a word but equally well may be a diphone, 10 factional elements of a recognition processor according to 

phoneme, allophone, etc. Recognition is the process of 1 e inventl0n > 

matching an unknown utterance with a predefined transition FIG * 3 ^ a block diagram indicating schematically the 

network, the network having been designed to be compatible components of a classifier forming part of FIG. 2; 

with what a user is likely to say. FIG. 4 is block diagram showing schematically the struc- 

In order to identify the phrase that has been recognised, 15 ture of a sequence parser forming part of the embodiment of 

the system may include means for tracing the path link back FIG. 2; 

through the network. FIG. 5 shows schematically the content of a field within 

Alternatively, the system may also include means for a store forming part of FIG. 5; 

assigning a signature to at least some of the nodes having FIG. 6 is a schematic representation of one embodiment 

associated word representation models and means for com- 20 of a recognition network applicable with the processor of the 

paring the signature of each path, to determine the path with sequence parser of FIG. 4; 

the best match to the input speech and that with the second FIG. la shows a node of a network and FIG. lb shows a 

best alternative match. path link as employed according to the invention; 

This arrangement allows for an alternative which is fig. 8 to 10 show the progression of path links through 

necessarily different in character to the best match and does the network of FIG. 6- 

not differ merely in segmentation or noise matches. FIG. 11 is a schematic representation of a second embodi- 

The word representation models may be Hidden Markov meal 0 f a De twork of a system according to the invention; 

Models (HMMs) as described generally in British Telecom an[ j 

Technology Journal, April 1988, Vol, 6, no. 2, page 105: 3Q pjQ 12 is a schematic representation of a third erabodi- 

Cox, "Hidden Markov Models for automat.c speech recog- men , of a network o£ a tem accordin t0 the mventiorj . 

nition: theory and application , templates, Dynamis Time 

Warping models or any other suitable word representation DETAILED DESCRIPTION OF EXEMPLARY 

model. The processing which occurs within a model is EMBODIMENTS 

irrelevant as far as this invention is concerned. 35 Referring to FIG. 1, a telecommunications system includ- 

It is not necessary for all the nodes having associated ing speech recognition generally comprises a microphone 1, 

word models to have a signature assigned to them. Depend- typically forming part of a telephone handset, a telecommu- 

ing on the structure of the transition network, it may be nications network (typically a public switched telecommu- 

sufEcient only to assign signatures to those nodes which nications network (PSTN)) 2, a recognition processor 3, 

appear before a decision point within a network. A decision 4Q connected to receive a voice signal from the network 2, and 

point as used herein relates to a point in the network which a utilising apparatus 4 connected to the recognition proces- 

has more than one incoming path. sor 3 and arranged to receive therefrom a voice recognition 

Partial paths may be examined at certain decision points signal, indicating recognition or otherwise of a particular 

in the network, certain constraints being imposed at these word or phrase, and to take action in response thereto. For 

decision points so that only paths conforming to the con- 45 example, the utilising apparatus 4 may be a remotely oper- 

straints are propagated, as described in the applicants' ated banking terminal for effecting banking transactions. 

International patent application filed on Mar. b 31st,l 994 In many cases, the utilising apparatus 4 will generate an 

entitled "Connected Speech Recognition", claiming priority auditory response to the speaker, transmitted via the network 

from European applications Nos. 93302539.7 and 2 to a loudspeaker 5 typically forming a part of the sub- 

93304503.1, corresponding to copending U.S. patent appli- 50 scriber handset. 

cation Ser. No. 08/530,170 filed Oct. 11, 1995 naming In operation, a speaker speaks into the microphone 1 and 

Smyth et al as inventors. Each decision point is associated an analog speech signal is transmitted from the microphone 

with a set of valid signatures and any path links with i i nt0 me network 2 to the recognition processor 3, where 

signatures that are not in the set are discarded. the speech signal is analysed and a signal indicating iden- 

The accumulated signature may be used to identify the 55 tification or otherwise of a particular word or phrase is 

complete path, resulting in extra efficiency of operation as generated and transmitted to the utilising apparatus 4, which 

the path links need not be traversed to determine the path then takes appropriate action in the event of recognition of 

identity, and the partial path information of the token may the speech. 

not be generated at all. In this case the signature field must Typically, the recognition processor needs to acquire data 

be large enough to identify all paths uniquely. 60 concerning the speech against which to verify the speech 

For efficient operation of the system according to the signal, and this data acquisition may be performed by the 

invention, the signal processing of path signatures is pref- recognition processor in a second mode of operation in 

erably carried out in a single operation to increase process- which the recognition processor 3 is not connected to the 

ing speed. utilising apparatus 4, but receives a speech signal from the 

Other aspects and preferred embodiments of the invention 65 microphone 1 to form the recognition data for that word or 

are as disclosed and claimed herein, with advantages that phrase. However, other methods of acquiring the speech 

will be apparent hereafter. recognition data are also possible. 
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Typically, the recognition processor 3 is ignorant of the Classifier 34 

route taken by the signal from the microphone 1 to and Referring to FIG. 3, in this embodiment, the classifier 34 

through the network 2; any one of a large variety of types comprises a classifying processor 341 and a state memory 

and qualities of receiver handset. Likewise, within the 342. 

network 2, any one of a large variety of transmission paths 5 The state memory 342 comprises a state field 3421, 

may be taken, including radio links, analog and digital paths 3422, . . . , for each of the plurality of speech states. For 

and so on. Accordingly, the speech signal Y reaching the example, each allophone to be recognised by the recognition 

recognition processor 3 corresponds to the speech signal S processor comprises 3 states, and accordingly 3 state fields 

received at the microphone 1, convolved with the transfer are provided in the state memory 342 for each allophone. 

characteristics of the microphone 1, link to network 2, 10 ^ classification processor 34 is arranged to read each 

channel through the network 2, and link to the recognition state , field wl * lD the memor y 3 f m turn, and calculate for 

i l-u li j jj - *ju each, using the current input feature coefficient set, the 

processor 3, which may be lumped and designated by a , . . 4 c \ t , \ , 

r . . ' , • tt probability that the input feature set or vector corresponds to 

single transfer characteristic H. f. /L r n ■ T r 

n c • „ T/ -, _ . . . the corresponding state. 

Referring to FIG. 2, the recognition processor 3 comprises Accordingly, the output of the classification processor is 

an input 31 for receiving speech in digital form (either from is a p i ura lity of state probabilities P, one for each state in the 

a digital network or from an analog to digital converter), a state mem ory 342, indicating the likelihood that the input 

frame processor 32 for partioning the succession of digital feature vector corresponds to each state, 

samples into a succession of frames of contiguous samples; The classifying processor 341 may be a suitably pro- 

a feature extractor 33 for gene rating from a frame of samples grammed digital signal processing (DSP) device, may in 

a corresponding feature vector; a classifier 34 receiving the 20 particular be the same digital signal processing device as the 

succession of feature vectors and operating on each with a feature extractor 33, 

plurality of model states, to generate recognition results; a Sequencer 35 

sequencer 35 which is arranged to receive the classification Referring to FIG. 4, the sequencer 35 in this embodiment 

results from the classifier 34 and to determine the predeter- comprises a state sequence memory 352, a parsing processor 

mined utterance to which the sequence of classifier output 25 351 > and a sequencer output buffer 354. 

indicates the greatest similarity; and an output port 38 at Mso provided is a state probability memory 353 which 

which a recognition signal is supplied indicating the speech stores > u for u eacl J fra R me processed the state probabilities 

utterance which has been recognised. out P ut b * * e classjfler P rocessor **}■ ™ e state ^ en <? 

Frame Generator 32 memory 352 composes a plurality of state sequence fields 

The frame generator 32 is arranged to receive speech 30 3521 > 3522 ' • • • > each corresponding to a word or phase 

samples at a rate of, for example, 8,000 samples per second, s «) uence t0 be 'ecogntsed consisting of a string of alio- 

and to form frames comprising 256 contiguous samples, at pones. 

a frame rate of 1 frame every 16 ms. Preferably, each frame Each state sequence m the state sequence memory 352 

is windowed (i.e. the samples towards the edge of the frame comprises as illustrated in FIG. 5 a number of states P,, P 2 , 

are multiplied by predetermined weighting constants) using, 35 ■ ■ ■** . where 1S a multiple of 3) and for each state, two 

for example, a Hamming window to reduce spurious Probabdities; a repeat probability (P f ) and a transition 

artifacts, generated by the frames edges. In a preferred P^uity to the fol tawmg state (P,^) The states of the 

embodiment, the frames are overlapping (for example by se ? uence are a Pl^ty of groups of three states each 

50%) so as to ameliorate the effects of the windowing. relalln S t0 a Sll f e .^P^ The observed sequence of 

Feature Extractor 33 40 slales associated with a series or irames may therefore 

Hie feature extractor 33 receives frames from the frame com P rise several repetitions of each state p. in each state 

generator 32 and generates, in each case, a set or vector of sequence model 3521 etc; for example: 
features. The features may, for example, comprise cepstral 
coefficients (for example, LPC cepstral coefficients or mel 
frequency cepstral coefficients as described in "On the 
Evaluation of Speech Recognisers and Databases using a 
Reference System", Chollet & Gagnoulet, 1982 proc. IEEE 
p2026), or differential values of such coefficients 
comprising, for each coefficient, the differences between the 

coefficient and the corresponding coefficient value in the 50 The parsing processor 351 is arranged to read, at each 

preceding vector, as described in "On the use of Instanta- frame, the state probabilities output by the classifier proces- 

neous and Transitional Spectral Information in Speaker sor 341, and the previous stored state probabilities in the 

Recognition", Soong & Rosenberg, 1988 IEEE Trans, on state probability memory 353, and to calculate the most 

Acoustics, Speech and Signal Processing Vol 36 No. 6 p871. likely path of states to date over time, and to compare this 

Equally, a mixture of several types of feature coefficients 55 with each of the state sequences stored in the state sequence 

may be used. memory 352. 

The feature extractor 33 outputs a frame number, incre- The calculation employs the well known HMM, as dis- 
mented for each successive frame. The output of the feature cussed in the above referenced Cox paper. Conveniently, the 
extractor 33 is also passed to an end pointer 36, the output HMM processing performed by the parsing processor 351 
of which is connected to the classifier 34. The end pointer 36 60 uses the well . known Viterbi algorithm. The parsing pro- 
detects the end of speech and various types are well known cessor 351 may, for example, be a microprocessor such as 
in this field. the Intel^ i-486^™ ) microprocessor or the Motorola (TW) 

The frame generator 32 and feature extractor 33 are, in 68000 microprocessor, or may alternatively be a DSP device 

this embodiment, provided by a single suitably programmed (for example, the same DSP device as is employed for any 

digital signal processor (DSP) device (such as the Motorola 65 of the preceding processors). 

DSP 56000, or the Texas Instruments TMS C 320) or similar Accordingly for each state sequence (corresponding to a 

device. word, phrase or other speech sequence to be recognised) a 
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probability score is output by the parser processor 351 at matching incoming speech and noise nodes 25 which rep- 
each frame of input speech. For example the state sequences resent arbitrary noise. 

may comprise the names in a telephone directory. When the , f >u of , he active nodes M 2S aftef and includin the 

end of the utterance is detected, a label signal a indicating ^ aM no(Je ^ afe ^ fc q{ ^ three ths 

the most probable state sequence is output from the parsing 5 , , . . . , . , , 

processor 351 to the output port 38, to indicate that the ( lje - each vocabulary node 24 is associated with three word 

corresponding name, word or phrase has been recognised. representation models), the output of the network will com- 

The parsing processor 351 comprises a network which is prise path links relating to the three top scoring paths of the 

specifically configured to recognise certain phrases or words system. As described with reference to FIGS. 8 to 10, the 

for example a string of digits. no three paths can be found by following, for each path, the 

FIG. 6 shows a simple network for recognising a string of pointers to the previous path links. The nodes on the paths 

words, in this case either a string of four words or a string ( and hence the input speech deemed to be recognised) can be 

of three. Each node 12 of the network is associated with a identified by looking at the pointers to the exited nodes, 

word representation model 13, for example a HMM, which _ . „ , . . . , , 

is stored in a model list. Several nodes can be associated 15 In a mrther development of the invention, the path links 

with each model and each node includes a pointer to its ma X be augmented with signatures which represent the 

associated model (as can be seen in FIGS. 6 and la). In order significant nodes of the network. These significant nodes 

to produce a best match and a single alternative parse, the may, for example, include all vocabulary nodes 24. In the 

final node 14 is associated with two models so allowing this embodiment of FIG, 11, each vocabulary node 24 is 

node to process two paths. If n parses are required, the final 20 assigned a signature, for example the nodes representing the 

node 14 of the network is associated with n identical word digit 1 are assigned a signature *1\ the nodes 24" repre- 

models. senting the digit 2 are assigned a signature '2' and so on. 

As shown in FIG. lb, a path link 15 contains information t , <• . . , ■ j 

relating to a pointer to the previous path link, an accumu- At the slart of P"™* a * m S le cm P tv P ath lu * 15 P ass * d 

lated score, a pointer to the node previously exited and a 25 int0 a network entrv node 26 ' Smce this 15 a nul1 node > the 

time index. At the start of an utterance, an empty path link path link is passed to the next node, a noise node 25. The 

15' is inserted into the first node 16 as shown in FIG. 8. The input frame is matched in the noise model (not shown) of 

first node now contains a path link and is therefore active this node and an updated path link produced at the output, 

whereas the remaining nodes are inactive. At each clock tick This path link is then passed to the next active nodes i.e. the 

(i.e. with each incoming frame of speech) any active nodes 30 fi^t vocabulary nodes 24 having an associated model (not 

accumulate a score in their path link. shown). Each vocabulary node 24 processes the frame of 

If the first model can match, say, a mimmum of seven h m i|s associated word model and pro duces an 

frames of speech, then at the seventh clock pulse a path link . 4 , t , , ™ . . c r .. ,. , . , 

r , * i *.. r t • updated path link. The signature field of the path link is also 

15" is output from the first node with the score for matching , , * . , * . . ? . j 

the seven frames to the model and pointers to the entry path 35 u P dated ' At the end of each time fra f ? > the u P dated P ath 

link and the node just matched. The path link is fed to all of Hnks are sorted t0 retain the three < n > t0 P P aths 

the following nodes 12, as shown in FIG. 9. Now the first which have different signature fields. A list ordered by score 

three nodes are active. The input frame of speech is then is maintained with the added constraint that accumulated 

matched in the models associated with the active nodes and signatures are unique: if a second path link with the same 

new path links outputted. 40 signature enters, the better of the two is retained. The list 

This processing continues, with the first node 16 produc- contains only the top "n" different paths, the rest being 

ing further path links as its model matches increasingly ignored 

longer parts of the utterance and the succeeding nodes ^ n ^ afe , ed , hrou ^ , he next Qull Qode 

performing similar calculation^ 22' to the following noise node 25 and vocabulary nodes 24", 

When the input speech has been processed as tar as the 45 , - , . , . , , . , . « I 

final node 18 of the network, path links from each 'branch' each of which ™ ^ociated with three identical word 

of the network may be presented to this node 18. If, at any representation models. After this, model processing takes 

given time frame, there is a single path link (i.e. only one of P lace > resultin S in the updating of the lists of path links and 

the parallel paths has been completed) that path link is taken the extending of the paths into further nodes 24'", 25. It 

to be the best (and only) match and is processed by the final 50 should be clear that the signature fields of the path links are 

node 18. However, if there are two path links presented to not updated after processing by the null nodes 22 or the 

the final node 18, both are processed by that node, since the noise nodes 25 since these nodes do not have assigned 

final node 18 is able to process more than one path. The signatures. 

output path links are continuously updated at each frame of jh e pam are pr0 pagated along paths which pass 

speech. When the utterance is completed there will be two 55 through the remaining active nodes to produce, at an output 

path links 15'" at the output of the network, as shown in FIG. node 2 g y up t0 mree paths links indicating the relative scores 

10 (from which the pointers to previous path links and nodes and signatures> for example x 2 1( of the pat hs taken through 

have been excluded for the sake of clarity). the netwQrk n& h lmks are rontinuously updated until 

Hie full path can be found by following the pointers to the ^ end of fa ^ detected (for fc b afl extemal 

previous path links and the nodes on the recognised path 60 , . , . , \ 4 . 

4 j * t- ■ j\ u device such as a pause detector or, until a time out is 

(and hence the input speech deemed to be recognised) can be , , x . . . r . . . . . , 

•j * * c j u. i i" **u * * * tU j ** j reached). At this point, the pointers or the accumulated 

identified by looking at the pointers to the nodes exited. . 7 „„ ,,., , 

FIG. 11 represents a second embodiment of a network signatures of the path links at the output node 28 are 

configured to recognise strings of three digits. The grey examined to determine the recognition results, 

nodes 22 are null nodes in the network; the white nodes are 65 For example, presuming that the following three path 

active nodes which may be divided into vocabulary nodes 24 links are presented to the output node 28 at some time 

with associated word representation models (not shown), for instant: 
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SCORE 


SIGNATURE 




A 


10 


1 2 2 


5 


8 


9 


1 2 2 




C 


7 


1 3 2 





Path A, the highest scoring path, is the best match. However, 
although path B has the second best score, it would be 10 
rejected as an alternative parse since its signature, and hence 
the speech deemed to be recognised, is the same as path A, 
Path C would therefore be retained as the second best parse. 

If the strings to be recognised have more structure than 
that discussed above, for example spelt names, signatures 15 
need only be assigned to nodes just before decision points, 
rather than at every vocabulary node, FIG. 12 shows a 
network for recognising the spelling of the names "Phil", 
"Paul" and "Peter". For simplicity, no noise is illustrated. 
The square nodes 44 indicate where the signature should be 20 
augmented. 

The system can distinguish between the 'PHI' and 'PALP 
paths at the *V node because the signatures of the path links 
created at the previous nodes are different. The following 
node 47 will be able to distinguish between all three inde- 25 
pendent paths since the signatures of the square nodes 44 are 
different. Only the node and the final noise node 48 need 
to be associated with more than one identical word model, 
so that these models are capable of processing more than one 
path. 30 

In all cases, each network illustrating the speech to be 
recognised requires analysis to determine which nodes are to 
be assigned signatures. In addition the network is configured 
to be compatible with what a user is likely to say. 

Savings in memory size and processing speed may be 35 
achieved by limiting the signatures that a node will 
propagate, as described in the applicants' International 
patent application filed on Mar. 31, 1994 entitled "Con- 
nected Speech Recognition", claiming priority from Euro- 
pean application No. corresponding to co-pending U.S. 40 
patent application Ser. No. 08/530,170 filed Oct. 11, 1995 
naming Smyth et al as inventors incorporated herein by this 
reference. For instance, say the only valid input speech to a 
recogniser having the network of FIG. 6 is the four follow- 
ing numbers only: 111, 112, 121, 211. Certain nodes within 45 
the network are associated with a set of valid signatures and 
a path will only be propagated by such a 'constrained' node 
if a path link having one of these signatures is presented. To 
achieve this, the signature fields of the path links entering a 
constrained node, eg third null node 22', are examined. If the 50 
signature field contains a signature other than 1 or 2, the path 
link is discarded and the path is not propagated any further. 
If an allowable path link is presented, it is passed on to the 
next node. The next constrained node is the null node 22" 
after the next vocabulary nodes. This null node is con- 55 
strained to only propagate path links having a signature 11, 
12 or 21. The null node 22"' after the next vocabulary nodes 
is constrained to only propagate path links having the 
signature 111, 112, 121 or 211. Such an arrangement sig- 
nificantly reduces the necessary processing and allows for a 60 
saving in the memory capacity of the apparatus. Only some 
of the nodes at decision points in the network need to be so 
constrained. In practice, a 32 bit signature has proved to be 
suitable for sequences of up to 9 digits. A 64 bit signature 
appears suitable for a 12 character alphanumeric string. 65 

End of speech detection and various other aspects of 
speech recognition relevant to the present invention are 



more fully set out in the applicants' International Patent 
Application filed on Mar. 25, 1994 entitled "Speech 
Recognition", claimimg priority from European patent 
application No. corresponding to copending U.S. patent 
application Ser. No. 08/525,730 filed Dec. 19, 1995 naming 
Power et al as inventors which is herein incorporated by this 
reference. 

In the above described embodiments, recognition process- 
ing apparatus suitable to be coupled to a telecommunications 
exchange has been described. However, in another 
embodiment, the invention may be embodied on simple 
apparatus connected to a conventional subscriber station 
(mobile or fixed) connected to the telephone network; in this 
case, analog to digital conversion means may be provided 
for digitising the incoming analog telephone signal. 

What is claimed is: 

1. A speech recognition system comprising: 

means for deriving a recognition feature vector from an 
input speech signal for each of predetermined time 
frames; 

means for modelling predetermined possible input speech 
comprising a plurality of vocabulary nodes each of 
which has an associated word representation model and 
links between said vocabulary nodes; 

processing means for comparing the recognition feature 
vectors with the modelled input speech and for gener- 
ating a path link for each node and time frame, said path 
links indicating the most likely prior sequence of 
vocabulary nodes for each vocabulary node and time 
frame, each path link comprising a field for storing an 
accumulated recognition score and a field for storing a 
reference to the most likely previous path link in the 
sequence; and 

means for indicating recognition of the input speech 
signal in dependence upon the comparison; 

the processing means being capable of processing more 
than one path link for at least one vocabulary node, 
other than the final node, in a single time frame. 

2. A speech recognition system as in claim 1 wherein at 
least one of the vocabulary nodes is associated with more 
than one identical word representation model, 

3. A speech recognition system as in claim 2 wherein the 
word representation models are Hidden Makov Models. 

4. A speech recognition system as in claim 1 wherein all 
the vocabulary nodes have signatures assigned to them. 

5. A speech recognition system as in claim 1 wherein only 
the vocabulary nodes occurring before a decision point have 
signatures assigned to them. 

6. A speech recognition system as in claim 4 wherein the 
path links include an accumulated signature. 

7. A speech recognition system as in claim 4 wherein at 
least some of the nodes are constrained only to propagate 
path links having certain predetermined signatures. 

8. A speech recognition system as in claim 4 wherein the 
recognition indicating means includes means for comparing 
the said recognition score and signature of the path links to 
determine the path with the best match to the input con- 
nected speech and those with the next best alternative 
matches. 

9. A method of speech recognition comprising: 
deriving a recognition feature vector from an input speech 

signal for each of predetermined time frames, 
modelling predetermined possible input speech; 
comparing the feature data with the modelled input 
speech by generating a network containing a plurality 
of vocabulary nodes associated with word representa- 
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tion models and generating a path link for each node 
and time frame, the path link indicating the most likely 
prior sequence of vocabulary nodes for each vocabu- 
lary node and time frame, each path link comprising a 
field for storing an accumulated recognition score and 
a field for storing a reference to the most likely previous 
path link in the sequence; and 
indicating recognition of the speech independence upon 
the comparison wherein more than one path link is 
processed in a single time frame for at least one 
vocabulary node other than the final node. 

10. A method as in claim 9 wherein the at least one of the 
vocabulary nodes is associated with more than one identical 
word representation model. 

11. A method as in claim 10 wherein at least one of the 
vocabulary nodes is associated with a number of identical 
word representation models equal to the number of desired 
recognition results. 

12. A method as in claim 10 wherein the said recognition 
scores of the path links are compared at each decision point 
of the network, only the n top scoring path links being 
propagated to the next nodes(s), 

13. A method as in claim 10 wherein signatures are 
assigned to all the vocabulary nodes. 
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14. A method as in claim 10 wherein signatures are only 
assigned to vocabulary nodes occurring before a decision 
point in the network, 

15. A method as in claim 12 wherein signatures are 
5 assigned to all the vocabulary nodes and the signatures of the 

path links are also compared, only path links having differ- 
ent signatures being propagated to the next node(s). 

16. A method as in claim 13 wherein at least some nodes 
io are constrained only to pass path links having certain 

predetermined signatures in their signature fields. 

17. A method as in claim 9 wherein the input speech signal 
deemed to be recognised is determined by tracing the path 
links back through the network. 

18. A method as in claim 13 wherein the input speech 
signal deemed to be recognised is determined by the accu- 
mulated signature of each path link. 

19. A method as in claim 10 wherein the best scoring path 
20 link is processed by the first word representation model of a 

vocabulary node, the next best by the second and so on until 
either parallel models or incoming path links run out. 

***** 
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